CAGEF_services_slide.png

Introduction to Python for Data Science

Lecture 07: User-defined Functions

Student Name: Live Lecture HTML

Student ID: 220223 completed


0.1.0 About Introduction to Python

Introduction to Python is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.

The structure of this course is a code-along style; It is 100% hands on! A few hours prior to each lecture, the materials will be available for download at QUERCUS and also distributed via email. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with Python code along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into Python - when required. This learning approach will allow you to spend the time coding and not taking notes!

As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark).

0.1.1 Where is this course headed?

We'll take a blank slate approach here to Python and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:

and get you to a point where you can:


0.2.0 Lecture objectives

Welcome to this final lecture in a series of seven. We've previously covered data structures, data wrangling, exploratory data analysis, flow control, and most recently string search and manipulation but today we will pull from all of those areas to build our own user-defined functions.

At the end of this lecture we will aim to have covered the following topics:

  1. Generating user-defined functions.
  2. Variables and their scope.
  3. Case study 1: building a transforming function.
  4. Case study 2: automating data mining.
  5. IPython magics and the rpy2 package.

0.3.0 A legend for text format in Jupyter markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.

Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn Python
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.

0.4.0 Data used in this lesson

Nothing too big today to use in lecture. We'll just briefly revisit a file from lecture 05.

0.4.1 Dataset 1: sequences.tsv

This is a small example dataset that we'll use briefly to demonstrate context managers.

0.4.2 Dataset2: taxa_pitlatrine.csv

We're bringing back one of our Lecture 04 datasets so we can reproduces some plots at the end of lecture using some "magic".


0.5.0 Packages used in this lesson

IPython and InteractiveShell will be access just to set the behaviour we want for iPython so we can see multiple code outputs per code cell.

numpy provides a number of mathematical functions as well as the special data class of arrays which we'll be learning about today.

pandas provides the DataFrame class that allows us to format and play with data in a tabular format.

re provide regular expression matching functions that are similar to those found in the programming language Perl


1.0.0 User-defined Functions

User defined functions allow you to write your own code as functions that you and others can (re)use. You could achieve similar results by copying and pasting code on-demand, but this is highly prone to errors: You may forget to replace some arguments or variables from a previous run. This may lead to serious issues that can compromise the accuracy and reliability of your code and analyses.

Also, if at some point you realized that your code had an error, you have to go back to every section and script that you wrote in order to update your code. Very tedious and risky, isn't it? Therefore, user-defined or customized functions can be very handy so you do not need to copy-paste code over and over. This is known as the DRY principle: Don't Repeat Yourself.

Here is a list of some of the advantages of writing your own functions:


Disclaimer:

User-defined functions can take time to write, depending on your Python skills and the complexity of the function that you're trying to construct. Thus, as with for loops, consider writing your own functions only when there are no publicly available, ready-to-use functions to do the task that you need to carry out. Be efficient and do not reinvent the wheel! Unless you have the luxury of time and just want to write code for fun.


1.1.0 Best practices

Here is where we start, and there are plenty of good reasons to start with best practices: Writing functions without following best practices is a recipe for madness. Among the best practices for writing your own functions, documenting your code should always be at the top. In the context of user-defined functions, documentation is known as docstring, a text that appears at the top of a function when ? or the help() function are called. A well-written docstring is expected to contain (at least) the following information:

1.1.1. Docstrings formats: Numpydoc

Docstrings are written within triple double-quotes """. Several docstrings formats are used by the Python community, including Numpydoc, GoogleStyle, reStructuredText, and EpyText. As with variable-naming conventions, regardless of which format you decide to use, the key is to be consistent across docstrings. In other words, pick a format and stick to it. The difference between the several formats is (mainly) in the order in which the parts of a function are documented. Other than that, the formats' contents are very similar.

Numpydoc is the most common docstring format in Python. The documentation for Numpy array is an example of Numpydoc docstring.


Docstring in the Numpydoc can have sections for Description, Parameters, Returns, See Also, Notes, and Examples. If some of your function's parameters are optional, you state so by specifying it in the description of that parameter. For example, my_arg is boolean and is optional, so its docstring would look like:

Parameters: my_arg: bool, optional

To retrieve just a function's docstring, you can use the .__doc__ attribute to get the raw docstring


print(.__doc__) hides the markup syntax and makes it more readable


Another option is to use the built-in inspect module, which also has many other useful functions to retrieve docstrings. Read more about this package at its documention page.


1.1.2 Do One Thing - but do it well

Another best practice when writing functions is the "Do One Thing" principle: each function should do one thing; one task. Instead of a big function, you can write several small ones per task, without going to the other extreme which would be fragmenting your code into a ridiculous amount of code snippets. By doing the one thing, your functions become:

Time to start writing our own functions.

1.2.0 Defining a function

To define a function, we need three key pieces plus the docstring:

Therefore, the structure of a function is:


1.3.0 Functions with no parameters

Let's make our first function print_square() but we'll keep it simple to start before building up to more complex functions. Without any parameters, a function will usually run without any external information and therefore will return the same output in most cases. Of course that assumes you're not making a random number generator - which we aren't!

Because this function has no parameters, the output will always be 16


1.4.0 Functions with a single parameter

A hard-coded function such as the one we defined above can be useful in very few occasions. Adding parameters would make it more flexible since it could alter its output based on user-provided input.

Let's redefine print_square() but this time include a parameter value that we will use as the base to our squaring function.

This time, running square() without parameters will throw a TypeError

The function now works with any integer or float...

but not strings!

A parameter is what you add to the function headers. When using the function, those parameters are referred as arguments.

1.4.1 Function parameters can have a default value

When we tried to run print_square() without providing a value, it generated a TypeError since we were not assigning an appropriate object to the parameter value. You may have noticed throughout this course that a number of the functions and methods we use have default values for some of their parameters. This essentially makes the parameter optional for the user, and provides a default behaviour to the function itself.

You can do the same by assigning default values to your user-defined functions too. Let's update print_square() to have a default value.


1.5.0 Use the return statement to pass an object reference back from your function

Instead of printing, we can return the final value of our function. The return statement both ends a function and passes a reference to an object (integer, list, or other items) that has a meaning for both humans and computers. In contrast, when we call on print(), it only prints a string for the human eye, but what is being printed means nothing to the program.

Let's make a new function, return_square() by replacing our print() call with a return statement.


1.5.1 Assign the object from your function to retain it in memory

We attempted to access new_value which only exists within square(). It does not exist in memory after returning the reference. After this, nothing is holding onto the object, and thus it disappears from memory. In order to retain the output from return_square(), we need to assign it to a variable.


1.5.2 A function always returns an object

Recall our function print_square() which did not contain an explicit return statement. Instead it ended on a print() statement which still showed us the value of our squaring function. In the case of a function that completes without a return statement, Python will default to returning the None object.

That's right, a function will always return an object - either by explicit call or implicitly by design. Let's check back on our print_square() function.


Here's another example where we literally do nothing with our function!

Remember that the function body is a non-empty sequence of statements. In the example above, the pass statement is an null operation -when you execute it, nothing happens.


1.6.0 Add flow control to your functions

Functions are self-contained portions of code that can look, for the most part, like what we've been learning over the last five lectures. That means you can add loops, flow control statements, and even calls to other packages, modules, and user-defined functions within your own function.

Let's use some if and else statements in a new function.


1.7.0 Functions with multiple parameters

We've already worked with many functions that take in multiple parameters. Defining one is not much harder. Simply include additional parameters separated by a ,. Also, try to give your parameters some meaningful names while you're at it.


1.7.1 Assign your function arguments to specific parameters

Did you notice that we swapped the positions of the two arguments when we called raise_to_power()? It is possible to change the order of the arguments without affecting the results if you use the name of the parameters, in this case, base_value and exponent_value. If you swap the argument positions without assigning the parameters, Python will assign the arguments in the order defined by your function.

Here is a function that returns the dot product of two matrices


1.8.0 Functions with an arbitrary number of parameters

1.8.1 Use the unpacking operator * to create an arbitrary number of positional arguments

You can write functions that can take any number of arguments by adding *args as the only parameter - the syntax is the asterisk (*); args is just a word used by convention. Recall we've seen this with variable before with the unpacking/packing operator back in lecture 04? It's back to help pack multiple arguments into a single parameter.

Be aware that *args cannot be assigned default arguments of any kind; just use it as-is. The example below runs a print() for loop on every argument passed to the function.


1.8.2 Use the unpacking operator * to pass elements from an iterable

As you've probably guessed by now, you can also pass values to func() as a list or other iterable. The key to treating each element separately is to use the * unpacking operator here as well. It will break out each element of the list into a separate object assigned to *args - which itself is a tuple object.

You can also pass a predefined array preceded by an asterisk.


1.8.3 Use ** to assign an arbitrary number of keyword arguments

In our above example we were using positional arguments because we were not naming any of the values explicity in the func() call. In section 1.7.0 we assigned parameter values by their keywords.

In case you need a function with an arbitrary number of keywords, **kwargs will take care of that for you. **kwargs allows you to handle named arguments that have not been defined in advance. This means you could send an arbitrary (optional) group of named arguments to your function. This also serves the purpose of shortening the defining header for your function - especially if you plan to have a lot of parameters. Much like using *args this also provides some flexibility to your function in cases where the amount of information you need can vary by situation.

Again the key to making this work is with ** in your function definition. kwargs is simply the standard naming convention but not strictly required as the variable name.

Invincible_args_kwargs.jpg

Here is a function that returns a dictionary key:value pairs


1.8.3.1 Yes, **kwargs, you are a dictionary

In our above examples we created a dictionary object dictionary_aminoacids and passed that to the function with the ** operator. Is this the reason why we can treat kwargs as a dictionary object? No.

Regardless of how we pass arguments to it through our function, the **kwargs parameter _is a dictionary object. That means when we try to generate an iterator from it, it follows the same rules.

  1. By default, the dictionary creates an iterator of keywords.
  2. You can use pull specific iterator information from it using keys(), values(), and items() methods.

1.9.0 Lambda (In-line/Anonymous) Functions

In Lecture 05 we first encountered the lambda keyword which creates an in-line function that contains a single expression. Recall that an expression can be evaluated to a value while the lambda function cannot contain statements. The value of this expression is what the function returns when invoked.

Consider the following function, greeting() which we will define formally.

This particular function can also be written as a lambda function as follows:


1.9.1 Keep your lambda function anonymous or define a proper function

In the above example and in lecture 06 we saw some examples where we assigned our lambda function to a variable. This was done to help set up our examples for clarity when the lambda function was especially long.

The purpose of the lambda function, however, is to remain anonymous. This is for a number of reasons including transparency for those reading your code. Notice there is no docstring setup in your lambda function which means it is not simply queried when someone is reading through your code. If you have assigned your lambda elsewhere, a reader will not easily know its purpose. Whereas inserting the lambda function as it is meant to be, inline with your code, makes its function far easier to ascertain.

Question: What if I keep using the same or similar lambda code in different parts of my code?

Answer: Refactor your work and make a function that you can call on specifically. It will reduce potential errors and simplify your code in the long run. Do not assign your lambda to a variable.

Here is a simple anonymous function that shows clearly how lambda works.


In our above code, x Takes the value of each element in the list and multiplies it by 2. The list( ) and map( ) are there just to let us see the output of the in-line function. Somehow it resembles a for loop, don't you think?

The map() function takes on the form:

map(function, iterable[, iterable1, iterable2, ..., iterablen])

and applies the function object to each iterable within the list. This could be a pre-defined function, method or other Python "callable" but it is called without the () parentheses OR it could be an anonymous lambda function as we saw above. The map() function returns a type of iterator with the transformed elements. This takes the place of using an explicit for loop but you still have to deal with the iterator after it's returned.

Here's an example of map() on it's own.


Much like map(), the filter() function (not to be confused with pandas.DataFrame.filter()) can be used to quickly filter through a single iterable. While some objects that we've worked with can be filtered on an element-by-element basis (Numpy arrays and Pandas DataFrames), core Python iterables haven't been as easily filtered... until now!

The filter() function takes on the form:

    filter(function, iterable)

and applies function which is a decision function typically returning a boolean result. While there are other use-cases for filter() we won't get into those today. Unlike map(), the filter() function can only evalulate a single iterable but it will still return an iterator that maps the function to iterable elements to produce a boolean result for each. You can then treat this iterator like any other to evaluate the iterable elements individually or as a whole.

And yes, our function can be a lambda function too! The following code combines filter() with a lambda function to output each element of the list that starts with the letter "d".


Section 1.0.0 Comprehension Question: The following function comp_func appears to be lacking any coding comments! Take a moment to decipher what it does and determine what the output should be if we use a list of numbers: 2, 5, 8 and the keywords cum_sum (set to True), and product (set to False).

Answer:


2.0.0 Scope and what the kernel remembers

Up until now, we've been moving through our code without a care in the world. Recall that deep in the background, every time we declare a variable, it is assigned a place in memory. Whenever we assign an object to a variable, it acts as a tether to wherever that object exists in memory. We've called this lifeline the reference to the object. When we re-assign a variable to a new object, that original link is essentially destroyed.

In Python, and most other modern programming languages, whenever all references to an object have been destroyed, the object itself is (eventually) removed from memory. As another example, we saw instances of scope usage when declaring variables just outside a for loop. While things run slightly differently in the Jupyter Notebook

2.1.0 The scope of variables is tied to when/where they are declared

Scope can refer to multiple things but in Python it primarily is focused on

  1. Namespaces - how the Python interpreter recognizes variables in different parts of the program.
  2. Memory access - how a Python object exists in memory, how it can be accessed, and how long it can remain there.

Both are somewhat intertwined within the use of scope which changes when we use functions.

In a very simplified way, functions can be thought of as separate rooms in a house or sandboxes in a playground. When we call on these functions, we are entering a clean version of these rooms which are built with some knowledge of the basic house architecture.

Thus a variable is usually either global or local in scope. If it is local, then the information about it simply disappears at the end of the function. The scope of a variable can usually be considered as part of the same indentation of a function's section. After you've exited that section, anything explicitly declared within (ie new variables from that section) will be released from memory unless it has been passed on in a return statement.

Of course, Python also includes an encapsulating or non-local scope that allows named variables in nested functions to be seen by their encapsulating scope (like a room within a room). There are also other keywords that allow a program to see the global scope from a local context. We can't * (unpack) all of that here.

Scope isn't just a mouthwash: Learn more about Python scoping rules here.

Why is scope important?

Understanding the basics of this concept will save you a lot of troubles down the road as you make more and more complex programs. You'll learn to avoid declaring variables in the wrong place, or trying to access ones that no longer exist in your scope. Let's revisit our examples from above.


2.2.0 You can reuse the same variable names in different scopes

We've already seen this happening in our previous examples of print_square() and return_square(). In both cases we used the same variable name value without any issue. That's because each of those functions has their own namespace and neither can interact with each other, in the way they've been coded.

Let's try a more complex example where we are reusing a global variable inside a local context. Let's alter the value of global_value in our example from within the function by assigning it as an integer.


2.2.1 You cannot directly alter a global variable from a local context (usually)

Yes, you can assign a globally declared variable's name to a locally declared variable but you cannot directly change the value of the global variable. This comes down to namespaces again and the order of operations when we try to assign a variable. Let's do one last example and see what happens when we try to update the value of global_value.


2.3.0 Alter global variables by assignment in a global scope

Unfortunately we cannot run from this error. It's a Python design choice. When Python compiles all of the code into a computer-level language it builds the namespace and sees global_value = ... within the function. This means it will treat global_value as a local variable and even if we broke up the assignment moved it around, this fact will not change.

If you really need to alter global_value you'll want to pass it in as an argument and return a value instead that you can assign directly to global_value. This is the correct way to accomplish altering a global variable within a function.


Brace_for_examples.jpg
Now that you're ready, it's time to put it all together with some examples

3.0.0 Case Study 1: A string transformation function

Here is an example of a user-defined function using data from lecture 6 (Regular Expressions). The starting point was a DNA sequence.

Here we'll also use the \ as a form of line-continuation. Unlike the ''' triple-quote, this will not insert additional newlines into the string.

To translate that DNA into mRNA, we wrote a lambda function


3.1.0 Reviewing the structure of our code

Let's break down the lambda function above:

After translation into mRNA, we further translated the ribonucleotide sequence into amino acids based on their codons. For that purpose, we created a dictionary with codons as keys and encoded amino acids as values.


3.2.0 Convert mRNA sequence to amino acids

Last week we used a few different methods to convert our mRNA sequence to amino acids. In both instances we used list comprehension to achieve our goal. Now let's put together a user-defined function that we can call on from anywhere in the program.

We'll call it translate() and it will have two parameters:

Within our function we'll also use two re functions we didn't cover in class last week:


3.2.1 Reviewing the translate() function

Here is what the function translate() does:

Overall the code does not seem as obscure as it looked the first time you saw it, right? Admittedly it is still a bit of a Rube Goldberg code-wise. Here's a cleaner version of that code:


That's a much cleaner function now.

Section 3.0.0 Comprehension Question: Write a Numpydoc format docstring for our translate() function and update the function itself to use the dict.get() command in the lambda section of our function. Use the code cell below to create the update.

4.0.0 Case Study 2: Automating data mining

As part of your new job, you download a .csv file from a database twice a day. These files are compressed, therefore you need to extract them before use. Historically, staff have downloaded these files manually but you identify this as something that can be easily automated, so you decided to write a function to do the job for you and your team.

First, let's write the code for each task:

Our target file is hosted in https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.tgz

4.0.1 Create a download folder

We'll begin by creating a download folder using the os package. We can use os.makedirs to create all non-existent folders within the path we want to make.


4.0.2 Download the data

We'll now create a string path to the website where we want to download from. We'll use the urlretrieve() function from the urllib package to help pull down the data.


4.0.3 Extract the file

Now that we've downloaded the file to our new directory, we can go ahead and extract it from the .tgz container. We'll import another package, tarfile to help us open and extract the file. To do so we'll need to open() a connection to the file and then extractall() of its contents.


4.0.4 Load the extracted data

Now that the extracted data is sitting there, we can load it using the functions from pandas so we can work with it as a DataFrame.


4.1.0 Combine our steps to make some functions

Now you have a script that downloads the data and loads it to the environment. Let's split those four code blocks into two functions: One to fetch the data and another to load the data.

Let's start by fetching the data with a function named fetch_data()

Next, create the function that loads the data


4.2.0 Visualize our data as histograms

All of our functions are completed now so let's plot some of the data.


4.3.0 Render a map using latitude and longitude values

This housing dataset contains a latitude and longitude variable, so let's render it as a map, using the median_house_value and population variables to see what are the most populated areas (warmest colors)

So, from now on, all you need to do to download and work on your data is to define the link to the data warehouse, what directories you want to access, and run fetch_data() and load_data()


5.0.0 Thinking beyond Python

Throughout this entire course, we've been focused on just getting the basics of Python. To review some of the topics we've covered include:

  1. Basic Python objects, packages, and operations.
  2. Python data structures: lists, dictionaries, arrays, and DataFrames.
  3. Python data formatting and wrangling.
  4. Exploratory data analysis and visualizations.
  5. String manipulation and matching with regular expressions.
  6. Flow control with looping and branching code.
  7. Creating your own user-defined functions.

All of these topics help to build a foundation for analysing your data but also provide you with an understanding of the Python language so you can move forward with learning new packages that are relevant to your own research. For instance, knowing what a list or tuple is or how to unpack an interable will simplify learning new coding paradigms.

In this last section, we'll look beyond Python. For those with experience or planning to work with the programming language R, or with command-line programs for larger sequence analysis steps.


5.1.0 Run R code in Python with rpy2

While Python is very dynamic with a wide number of packages available, sometimes you might feel more comfortable working with something familiar. For instance, the ggplot2 package is de facto visualization package in R. Additional packages are built with ggplot2 as a basis, expanding on it's grammar-of-graphics style. Of course there is a relatively recent Python version of this package called plotnine. Comparing these two, there is an active community of over 250 contributors working on ggplot2 to update and improve it on a regular basis while plotnine is developed and maintained by a single individual.

Taking that into account, you may want to visualize your data with R rather than python! The same could be said for other standard bioinformatics packages from R, whose functionality might be useful, like DeSeq2 for differential expression analysis of RNA-Seq data.

Rather than splitting your time between two programming languages, and their development environments, you can run it all from a single script with the rpy2 package. Let's look at the two ways we can use this package to our advantage.

5.1.0.1 Import taxa_pitlatrine.csv

Before we jump into working with rpy2 let's bring in some interesting data to plot for our example. We'll bring back our friend taxa_pitlatrine.csv which we worked with back in Lecture 04. The following code cell will import, and filter the data just like we did last week.


5.1.1 Use rpy2 to import packages from the R library

To begin with, the rpy2 package provides a function importr that allows you to import R packages for use in Python by assigning them to variables/aliases. For everything to work, we'll actually need to import a number of packages and modules. More specifically we'll be importing:


5.1.2 Converting pandas.DataFrame to R data.frame

While the pandas and base R versions of data frame objects share significant overlap in behaviour and design, they are still distinct object types. In order to use R packages, we'll have to make the conversion from one to the other. This, itself, involves 2 additional modules. For the purpose of making our code concise, we will import the specific modules we want:


5.1.3 Import packages required for plotting your data

Now that we have our data converted to an R-based data.frame we can proceed to plotting it. Here we'll import the packages we'll need to plot the data and display it:


5.1.4 Plot your data using ggplot2

Now that we've set up all of the necessary components, we can go ahead and plot our data. We'll do this again within the confines of a context manager grdevices.render_to_bytesio(). This will ensure that, should something go wrong, the various connections created will be quietly closed without creating additional memory leaks.

In our context manager, we'll set the width and height of our image as well to produce an object called img that will hold the resulting plot information.

You'll also notice in our code that any ggplot2 functions normally loaded into memory in R, must be called with dot (.) notation in the normal Python style.


5.2.0 Using IPython magic to work with R

What a wild ride! We used import to bring in or alias eight different modules and components to make our graph. While many of those were use just to display our plot in the Jupyter notebook. While this will certainly not be the case within a simpler Python script, there are still many nuances involved with making this work.

Under the hood of the Jupyter Notebook, there are a number of IPython-specific magics. These are essentially syntactic shortcuts beginning with % that allow you to perform quick operations or directives involving your code. In our case, what we are really interested in is making our life with using R in Python just a little simpler.

5.2.1 Load an IPython extension with %load_ext

The IPython extensions are just what they sound like. Recall that our Jupyter Notebooks are built on an IPython shell which controls how we access Python and output the code we run. We have already seen an example of modifying this command via the IPython.core.interactiveshell. Extensions add extra modifications to the behaviour of this shell and we can add more using the magic command %load_ext <extension_name>.

To gain access to the rpy2 package and it's capabilities, we can use the extension rpy2.ipython.


5.2.2 Run R in-line with your Python code via %R

Sometimes you may wish to simply slip in some R code between calculations or lines in your Python code. Perhaps you want to access a package and generate a quick calculation in R. Do this with %R to let Python know your intent. Along with this magic comes a series of possible parameters including:


5.2.3 Convert whole code cells to R using %%R

When you are generating or running large portions of code, you can convert the whole code cell to work through rpy2 using the %%R magic. It works with the same parameters as %R to simplify the process.

Let's use this to create a code cell where we plot our data as above. Since we are plotting, we'll set the width and height of our plotting device at the same time.

You'll see, as well, just how simple the process of creating our plot is when using the R-magics. No additional pythonic coding with packages. Just directly coding you would (for the most part) in an R setting. The biggest difference here is we will also provide our pandas.DataFrame object, latrine_filtered, as input. This will automatically be converted by rpy2 for us.


The magic was in You(r IPython Jupyter Notebook) all along: While we've covered just how to run R through the magics, you can actually convert coding cells to run all sorts of languages - bash scripts, perl, Java and more! There are also many helpful shortcuts for quickly navigating different aspects of your Jupyter Notebook. You can find out more in the available documentation

5.3.0 Is this the end?

Prequel_ClassEnd_meme.jpg

Yes, our journey is over - for this class. Unfortunately we left a lot out of this course that you might find useful! We barely covered topics like:

And we didn't even touch on things like:

... so keep on learning!


6.0.0 Class summary

That's our seventh and final class on Python! We worked through a lot of important steps for building and using our own functions. To summarize, we've covered:

  1. Creating user-defined functions.
  2. Scoping your variables.
  3. Decoding user-defined functions from previous lectures.
  4. Building your own user-defined function to automate tasks.
  5. Using rpy2 to combine Python and R.

Best of luck with your data science journey!!

6.1.0 Submit your completed skeleton notebook (2% of final grade)

At the end of this lecture a Quercus assignment portal will be available to submit your completed skeletons from today (including the comprehension question answers!). These will be due one week later, before the next lecture. Each lecture skeleton is worth 2% of your final grade but a bonus 0.7% will also be awarded for submissions made within 24 hours from the end of lecture (ie 1700 hours the following day).

6.2.0 Post-lecture DataCamp assessment (8% of final grade)

Soon after the end of this lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete chapters 1-3 (Best Practices, 750 possible points; Context Managers, 800 possible points; and Decorators, 1050 possible points) from the Writing Functions in Python course. This is a pass-fail assignment, and in order to pass you need to achieve a least 1950 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.

In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course. It should look something like this where it shows the total points earned in each chapter:

DataCamp.example.png
A sample screen shot for one of the DataCamp assignments. You'll want to combine yours into single images or PDFs if possible

Submit the file(s) for the homework to the assignment section of Quercus. This allows us to keep track of your progress while also producing a standardized way for you to check on your assignment "grades" throughout the course.

You will have until 13:59 hours on Thursday, March 3rd to submit your assignment (right before the next lecture).


6.3.0 Final assignment guidelines (30% of final grade)

Your final project will be due two weeks after this lecture at 13:59 hours on Thursday March 10th. Please submit your final assignment as a single compressed archive which will include:

  1. Your Jupyter Notebook final project
  2. A PDF version of the notebook with all output from the code cell. This will be used for markup and comments that I can return to you about your projects.
  3. Any associated data needed to run your project. When you create your compressed file for submission, you can preserve the folder structure by compressing the entire folder with the needed files.
  4. KEEP YOUR INPUT/DATA SUBMISSION UNDER 100 Mb. These can be used to help re-run your code if inconsistencies are identified but please generate a subset of your data if it is excessively large.

JupyterHub.SaveDir.png

Please refer to the marking rubric found in your lecture notes for additional instructions and FAQs.

You can build your Jupyter Notebooks on the UofT JupyterHub and save/download a compressed archive to your personal computer for before submitting on Quercus.

Any additional questions can be emailed me to or posted to the Discussion section of Quercus. Best of luck!


6.4.0 Acknowledgements

Revision 1.0.0: materials prepared by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.1.0: edited and prepared for CSB1021H S LEC0140, 06-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.2.0: edited and prepared for CSB1021H S LEC0140, 01-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


6.5.0 Resources


7.0.0 Appendix 1: context managers

In the context of scope and memory, context managers allow you to create variables and manipulate objects in a way that ensures their memory footprint doesn't pollute your global environment. This can happen in a number of cases where Python opens a connection or generates a resource but then you forget to close it or during the course of that code-block an exception or error is encountered. In those cases, you may inadvertantly "lose" the connection but it is not destroyed by Python. This remaining connection produces a "memory leak" as it cannot be properly removed from the programs memory. It may also prevent you from re-opening the same connection or altering a file later outside of Python.

There are two main ways to dealing with this, the try/finally exception structure or a context manager. We won't go over all of the details but in both cases, when you wish to use a resource you usually:

  1. Allocate/initialize the resource.
  2. Read/Write/Transform the resource.
  3. Close or de-allocate the resource.

When your code runs, an error at step 2, may skip step 3 altogether. The try/finally statements encapsulate your code to ensure that if an error while trying step 2 occurs, it will still close the file when you are finally done.

Similarly, a context manager is a pre-formed version of this, where common Python actions are coded to catch errors and "recover gracefully". Context managers avoid having to produce additional complicated code and take the form:


The keyword with signals to Python the use of a context manager. If the context manager produces an object, like a file connect, the optional keyword as assigns the resulting manager to the variable target.

Here is an example of the context manager open to import a file and perform some operations on it.


7.0.1 The context manager cleans up for you after leaving the indentation group

From the above code, under the hood, when we leave the indentation group and print out our information, the context manager closes the file connection. If for some reason, we called on an incorrect method, we might be returned a warning or exception that could end the program. Before the kernel exits the program, however, the context manager's code would execute an __exit__ method which includes shutting down the file connection. Therefore the file is not occupying space in memory anymore!


7.1.0 Defining context managers with your own function

Context managers are pre-made functions that have certain assumptions about the order of how actions should be performed. While some are relatively simple like the open command, you may find yourself in a situation where you have specific needs.

You can use context managers to manage paired action patterns that you use often like:

You may be building a function where your expect a certain data format or style. To use context managers inside a function, you need to define it like you would define a function. At the end of the function, use the keyword yield followed by any code that would help clean up the context (optional).

Cleanup code can be a piece of code that, for example, disconnects from a remote database after you are finished using it. An easy way to build context managers is with the @contextlib.contextmanager decorator right above the context manager.

Decorator syntax: In Python, the decorator syntax @ is a fancy way of saying "Pass function_A to @function_B to run inside function_B." Whenever you call on function_A it will redirect into function_B, which tends to extend the function of function_A. This action is also known as wrapping.

Let's take a look at the format.


7.1.1 The yield statement is where the cleanup happens

The yield statement tells Python that you want to return a value but that you will finish the rest of the function afterwards. More importantly, this statement saves the state of the local variables. Remember what we talked about with scoping? When we yield a function, it's scope and state are saved. Returning to the function picks back up where it was the yield was called. For a context manager, we can include this call only once. The contextmanager wrapper is coded specifically to look for this structure to initiate your function cleanup.

Here is a more useful example of how to use a context manager. Let's say that you have set your working directory, but sometimes you need to access files outside your current working directory. Let's write a context manager that changes the current working directory, performs a tasks, and changes the path back to your previously set working directory.

Let's see the context manager in action!


7.1.2 Your context manager can return a None object

From our example above, you can see that our in_dir() context manager returns the None object when we assign it to the variable lec_5. This is apparent in the context manager code itself where it takes in a path and uses the os package to change the directory. Nothing is returned, so nothing is to be assigned to lec_5 afterwards. We could pass it to our call os.listdir(lec_5) instead of os.listdir() but it isn't necessary.

More importantly, the context manager works and your current working directory is intact!


The Center for the Analysis of Genome Evolution and Function (CAGEF)

The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.

From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.

For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.

CAGEF_new.png